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Abstract 

In the Shortest Superstring problem we are given a set of strings S = {si,...,s n } 
and integer £ and the question is to decide whether there is a superstring s of length at most 
t containing all strings of S as substrings. We obtain several parameterized algorithms and 
complexity results for this problem. 

In particular, we give an algorithm which in time poly(n) finds a superstring of length 
at most £ containing at least k strings of S. We complement this by the lower bound showing 
that such a parameterization does not admit a polynomial kernel up to some complexity as¬ 
sumption. We also obtain several results about “below guaranteed values" parameterization of 
the problem. We show that parameterization by compression admits a polynomial kernel while 
parameterization “below matching" is hard. 


1 Introduction 

We consider the SHORTEST SUPERSTRING problem defined as follows: 


Shortest Superstring 

Input: A set of n strings S = {si,..., s n } over an alphabet £ and a non-negative integer l. 

Question: Is there a string s of length at most l containing all strings from S as substrings? 

This is a well-known NP-complete problem nn with a range of practical applications from DNA 
assembly [8] till data compression EDI • Due to this fact approximation algorithms for it are widely 
studied. The currently best known approximation guarantee 2^ is due to Mucha [18]. At the same 
time the best known exact algorithms run in roughly 2 n steps and are known for more than 50 years 
already. More precisely, using known algorithms for the TRAVELING SALESMAN problem, SHORTEST 
Superstring can be solved either in time 0*(2 n ) and the same space by dynamic programming 
over subsets mm or in time 0*(2 n ) and only polynomial space by inclusion-exclusion [151 ITT] (here, 
0*(-) hides factors that are polynomial in the input length, i.e., yT =1 |s*|)- Such algorithms can 
only be used in practice to solve instances of very moderate size. Stronger upper bounds are known 
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for a special case when input strings have bounded length mm There are heuristic methods 
for solving TRAVELING SALESMAN, and hence also SHORTEST SUPERSTRING, they are efficient in 
practice, however have no efficient provable guarantee on the running time (see, e.g., hd- 

In this paper, we study the SHORTEST SUPERSTRING problem from the parameterized com¬ 
plexity point of view. This field studies the complexity of computational problems with respect not 
only to input size, but also to some additional parameters and tries to identify parameters of input 
instances that make the problem tractable. Interestingly, prior to our work, except observations 
following from the known reductions to TRAVELING SALESMAN, not much about the parameterized 
complexity of Shortest Superstring was known. We refer to the survey of Bulteau et al. [5] for 
a nice overview of known results on parameterized algorithms and complexity of strings problems. 
Thus our work can be seen as the first non-trivial step towards the study of this interesting and 
important problem from the perspective of parameterized complexity. 

Our results In this paper we study two types of parameterization for SHORTEST SUPERSTRING 
and present two kind of results. The first set of results concerns “natural" parameterization of the 
problem. We consider the following generalization of SHORTEST SUPERSTRING: 

Partial Superstring 

Input: A collection (multiset) of strings S over an alphabet E, and non-negative integers k,i. 

Question: Is there a string s of length at most £ such that s is a superstring of a collection of 

at least k strings S' C S ? 

If k = |S|, then this is SHORTEST SUPERSTRING. Notice that S can contain copies of the same 
string and a string of S can be a substring of another string of the collection. For SHORTEST 
Superstring, such cases could be easily avoided, but for Partial Superstring it is natural to 
assume that we have such possibilities. 

Here we show that PARTIAL SUPERSTRING is fixed parameter tractable (FPT) when parame¬ 
terized by k or t. We complement this result by showing that it is unlikely that the problem admits 
a polynomial kernel with respect to these parameters. 

The second set of results concerns “below guaranteed value" parameterization. Note that an 
obvious (non-optimal) superstring of S' = {si,...,s n } is a string of length I s *I formed by 

concatenating all strings from S. For a superstring s of S the value ^" = i I s *I — N is called by 
compression of s with respect to S. Then finding a shortest superstring is equivalent to finding an 
order of si,. .., s n such that the consecutive strings have the largest possible total overlap. We first 
show that it is FPT with respect to r to check whether one can achieve a compression at least 
r by construction a kernel of size 0(r 4 ). We complement this result by a hardness result about 
“stronger" parameterization. Let us partition n input strings into n/2 pairs such that the sum 
of the n/2 resulting overlaps is maximized. Such a partition can be found in polynomial time by 
constructing a maximum weight matching in an auxiliary graph. Then this total overlap provides 
a lower bound on the maximum compression (or, equivalently, an upper bound on the length of a 
shortest superstring). We show that already deciding whether at least one additional symbol can 
be saved beyond the maximum weight matching value is already NP-complete. 

2 Basic definitions and preliminaries 

Strings. Let s be a string. By |s| is denoted the length of s. By s[i], where 1 < i < |s|, is 
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denoted the i-th. symbol of s, and s[i,j] = s[i]... s[j] for 1 < i < j < |s|. We assume that s\i,j\ 
is the empty string if i > j. We denote prefbq(s) = s[l,i] and suffbq(s) = s[|s| — i + 1, |s|] the 
i-th prefix and i-th suffix of s respectively for i £ {1,... , |s|}; prefix 0 (s) = suffixo(s) is the empty 
string. Let s, s' be strings. We write s C s' to denote that s is a substring of s'. If s C s', then 
s' is a superstring of s. We write s C s' and s D s' to denote proper sub and superstrings. For a 
collection of strings S, a string s is a superstring of S if s is a superstring of each string in S. The 
compression measure of a superstring s of a collection of strings S is \ x \ ~ l s l- ^ 8 s '> then 

overlap(s, s') = overlap(s 7 , s) = s; otherwise, if s <2 s' and s' (Z s, then over lap (s, s') = suffix r (s) = 
prefix ? .(V). where r = rnaxf'i | 0 < i < min{|s|, ||}, suffix,; (s) = prefixes 7 )}. We denote by ss' the 
concatenation of s and s'. For strings s, s', we define the concatenation with overlap so s' as follows. 
If s C s', then s o s' = s' o s = s'. If s % s' and s' % s, then so s' = prefix f ,(s)overlap(s, s')suffix g (s'), 
where p = |s| — |overlap(s, s 7 )| and q = |s 7 | — |overlap(s, s 7 )|. 

We need the following folklore property of superstrings. 

Lemma 1. Let s be a superstring of a collection S of strings. Let S' = {si,...,s n } be a set 
of inclusion maximal pairwise distinct strings of S such that each string of S is a substring of a 
string from S'. Let also Si = s\pi,qf\ fori £ {l,...,n} and assume that p\ < ■■■ < p n . Then 
s' = si o • • • o s n is a superstring of S of length at most |s|. 

Graphs. We consider finite directed and undirected graphs without loops or multiple edges. The 
vertex set of a (directed) graph G is denoted by V(G), the edge set of an undirected graph and the 
arc set of a directed graph G is denoted by E(G). To distinguish edges and arcs, the edge with 
two end-vertices u,v is denoted by {u,u}, and we write (u,v) for the corresponding arc. For an arc 
e = (u,v), v is the head of e and u is the tail. Let G be a directed graph. For a vertex v £ V(G), 
we say that u is an in-neighbor of v if (u, v) £ E(G). The set of all in-neighbors of v is denoted by 
Nq(v). The in-degree df ; (v) = |A' r g(r;)|. Respectively, u is an out-neighbor of v if ( v,u ) £ E(G), 
the set of all out-neighbors of v is denoted by Nq(v), and the out-degree d^iv) = |A r f t(u)|. For 
a directed graph G, a (directed) trail of length k is a sequence Vo, e,\, v\, e 2 ,..., e*,, of vertices 
and arcs of G such that Vo,...,Vk £ V(G), e\,...,ek £ E(G), the arcs e\,...,ek are pairwise 
distinct, and for i £ {1,..., k}, e* = (vi—i,Vi). We omit the word “directed” if it does not create a 
confusion. Slightly abusing notations we often write a trail as a sequence of its vertices vq, ... ,Vk or 
arcs ei,..., efc. If vq, ..., are pairwise distinct, then vq, ..., is a (directed) path. Recall that a 
path of length |R(G)| — 1 is a Hamiltonian path. For an undirected graph G, a set U C V(G) is 
a vertex cover of G if for any edge {u, v} of G, u £ U or v € U. A set of edges M with pairwise 
distinct end-vertices is a matching. 

We consider the following auxiliary problem: 

Long Trail 

Input: A directed graph G and a non-negative integer t. 

Question: Is there a trail of length at least l in G! 


Lemma 2. LONG TRAIL is NP -complete. In particular, the problem is NT-complete if £ = 
\V(G)\-1. 

Proof. We reduce the HAMILTONIAN PATH problem for directed graphs that is well known to be 
NP-complete (see, e.g., mi)- Let G be a directed graph with n vertices. We construct the graph 
G' as follows. 
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• For each v £ V(G), construct two vertices v~,v + and an arc ( v~,v + ). 

• For each (u,v) £ E(G), construct an arc ( u + ,v ~). 

• Construct two vertices s,t and for each v £ V(G), construct arcs ( s,v ~), (v + ,t). 

We claim that G' has a trail of length at least 2n+l = \V (G / )| — 1 if and only if G has a Hamiltonian 
path. 

Suppose that G has a Hamiltonian path v\...., v n . Then the trail s, Vq ,Vq vfivf,..., v~, t 

in G' has length 2n + 1. 

Assume that G' has a trail P of length at least 2n+ 1. Without loss of generality we can assume 
that s is the first vertex of P and t is the last. To see it, suppose that x s is the first vertex of 
P. Notice that s is not in P , because d,Q,(s) = 0. If x = v~ for v £ V{G), then we can consider 
the extended trail s, ( s,x),P . If x = v + for v € V(G), then let u~ be the next vertex in P after 
x. We consider the path P' obtained from P by the replacement of x and ( x,u ~) by s and ( s,u ~) 
respectively. Clearly, P' has the same length as P. By the symmetric arguments, we obtain that we 
can assume that t is the last vertex of P. We have that any vertex of G' occurs exactly once in P, 
because d^fis) = = 0 and d,Q,(v~) = d^,(v + ) = 1 for v £ V(G). Moreover, for each vertex 

v £ V(G), ( v~,v + ) in P, because v~ is the unique in-neighbor of v + and v + is the unique out- 
neighbor of v~ respectively for v € V(G). Hence, P can be written as s, Vq , VqV± vf,... ,v~,v+,t 
for vi,... ,v n £ V ( G ). It remains to observe that vi ,..., v n is a Hamiltonian path in G. □ 

Parameterized Complexity. Parameterized complexity is a two dimensional framework for 
studying the computational complexity of a problem. One dimension is the input size and an¬ 
other one is a parameter. We refer to the books of Downey and Fellows (6], Flurn and Grohe [9], 
and Niedermeier [20] for detailed introductions to parameterized complexity. 

Formally, a parameterized problem V C X* x IN, where E is a finite alphabet, i.e., an instance 
of V is a pair (I, k) for I £ E* and k £ N, where / is an input and k is a parameter. It is said that 
a problem is fixed parameter tractable (or FPT), if it can be solved in time f(k) ■ for some 

function /. A kernelization for a parameterized problem is a polynomial algorithm that maps each 
instance (I, k) to an instance (i 7 , k') such that 

i) (/, k) is a yes-instance if and only if (i 7 , k') is a yes-instance of the problem, and 

ii) the size of I' and k' are bounded by f(k) for a computable function /. 

The output (i 7 , k') is called a kernel. The function / is said to be a size of a kernel. Respectively, 
a kernel is polynomial if / is polynomial. While a parameterized problem is FPT if and only if it 
has a kernel, it is widely believed that not all FPT problems have polynomial kernels. 

In particular, Bodlaender, Jansen and Kratsch |3] introduced techniques that allow to show that 
a parameterized problem has no polynomial kernel unless NP C coNP /poly. 

Let E be a finite alphabet. An equivalence relation 1Z on the set of strings E* is called a 
polynomial equivalence relation if the following two conditions hold: 

i) there is an algorithm that given two strings x, y £ E* decides whether x and y belong to the 
same equivalence class in time polynomial in \x\ + \y\, 

ii) for any finite set S C E*, the equivalence relation 1Z partitions the elements of S into a number 
of classes that is polynomially bounded in the size of the largest element of S. 
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Let L C E* be a language, let 1Z be a polynomial equivalence relation on E*, and let V C E* x IN 
be a parameterized problem. An OR-cross-composition of L into V (with respect to 1Z) is an 
algorithm that, given t instances x\, xi, ■ ■ ■, Xt £ E* of L belonging to the same equivalence class of 
7Z. takes time polynomial in \x{\ and outputs an instance (y,k) £ E* x N such that: 

i) the parameter value k is polynomially bounded in max{|aq|,... , |xt|} + logt, 

ii) the instance (y. k ) is a yes-instance for V if and only if at least one instance Xi is a yes-instance 
for L and i £ {1,... , t}. 

It is said that L OR-cross-composes into V if a cross-composition algorithm exists for a suitable 
relation 7Z. 

In particular, Bodlaender, Jansen and Kratsch j3] proved the following theorem. 

Theorem 1 ([!]). If an XP-hard language L OR-cross-composes into the parameterized problem V, 
then V does not admit a polynomial kernelization unless NP C coNP /poly. 

We use randomized algorithms for our problems. Recall that a Monte Carlo algorithm is a ran¬ 
domized algorithm whose running time is deterministic, but whose output may be incorrect with a 
certain (typically small) probability. A Monte-Carlo algorithm is true-biased ( false-biased respec¬ 
tively) if it always returns a correct answer when it returns a yes-answer (a no-answer respectively). 

3 FPT-algorithms for Partial Superstring 

In this section we show that PARTIAL SUPERSTRING is FPT, when parameterized by k or £. For 
technical reasons, we consider the following variant of the problem with weights: 

Partial Weighted Superstring 

Input: A collection of strings S over an alphabet E with a weight function w: S —> No, and 
non-negative integers k,£ and W. 

Question: Is there a string s of length at most l such that s is a superstring of a collection of 
k strings S' C S with w(S') > W1 

Clearly, if w = 1 and W = k, then we have the PARTIAL SUPERSTRING problem. 

Theorem 2. Partial Weighted Superstring can be solved in time 0((2e) k ■ kn 2 mlogW) by 
a true-biased Monte-Carlo algorithm and in time (2e) fc /c°^ ogfc ^ • n 2 logn • m log W by a deterministic 
algorithm for a collection of n strings of length at most m. 

Proof. First, we describe the randomized algorithm and then explain how it can be derandomized. 
The algorithm uses the color coding technique proposed by Alon, Yuster and Zwick [2]. 

If l > km, then the problem is trivial, as the concatenation of any k strings of S has length at 
most l and we can greedily choose k strings of maximum weight. Assume that £ < km. 

We color the strings of S by k colors 1 ,... ,k uniformly at random independently from each 
other. Now we are looking for a string s that is a superstring of k strings of maximum total weight 
that have pairwise distinct colors. 

To do it, we apply the dynamic programming across subsets. For simplicity, we explain only 
how to solve the existence problem, but our algorithm can be modified to find a colorful superstring 
as well. For X C {1,..., k}, a string x £ S and a positive integer h £ {1, ...,£}, the algorithm 
computes the maximum weight W(X,x,h) of a string s of length at most h such that 
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i) s is a superstring of a collection of k' = |A[ strings S' C S of pairwise distinct colors from A, 

ii) x is inclusion maximal string of S' and x = suffix^ds). 

If such a string s does not exist, then W(X,x,h ) = — oo. 

We compute the table of values of W(X,x,h) consecutively for |A| = 1,... ,k. To simplify 
computations, we assume that W(X,x,h ) = —oo for h < 0. If |A| = 1, then for each string x E S, 
we set W(X,x,h) = w(x) if x is colored by the unique color of X and |x| < h. In all other cases 
W(X,x,h ) = —oo. Assume that \X\ = k' > 2 and the values of W(X' ,x,h) are already computed 
if | A"'| < k'. Let 


W' = max{IT(A \ {c}, x, h) + w(y) \ y C x has color c £ X}, 


and 

W" = max{IT (X \ {c}, y,h — |x| + | overlap (y, x)|) + w(x) \ x <2 y, y % x}, 

where c is the color of x; we assume that W' = —oo if there is no substring y of x of color c E X, 
and IT" = — oo if every string y is a sub or superstring of x. We set IT(A, x,h) = max{W / , IT"}. 

We show that max{IT({1,... , k}, x,£) \ x E S} is the maximum weight of k strings of S colored 
by distinct colors that have a superstring of length at most £; if this value equals —oo, then there 
is no string of length at most t that is a superstring of k string of S of distinct colors. 

To prove this, it is sufficient to show that the values W(X,x,h ) computed by the algorithms 
are the maximum weights of strings of length at most h that satisfy (i) and (ii). The proof is by 
induction on the size of |A|. It is straightforward to verify that it holds if \X\ = 1. Assume that 
|A| > 1 and the claim holds for sets of lesser size. Denote by W*(X,x,h) the maximum weight 
of a string s of length at most h that satisfies (i) and (ii). By the description of the algorithm, 
IT*(A,x,/i) > IT(A,x,/i). We show that W*{X,x,h) < W(X,x,h). 

Let S' be a collection of k' strings of pairwise distinct colors from A that have s as a superstring. 
Denote by S" a set of inclusion maximal distinct strings of S' that contains x such that every string of 
S' is a substring of a string of S”. Assume that S" = {aq,..., x r } and x t = s\pi, qi\ for i E {1,..., r}. 
Clearly, x = x r . 

Suppose that there is y E S' \ {x} such that y C x. Let c E A be a color of y. Then s is a 
superstring of S' \ {y} and the total weight of these string is W*(X,x,h) — w(y). By induction, 
W*( A, x , h ) — w(y) < W(X \ {c}, x, h) and we have that W*( A, x,h) <W( A \ {c}, x, h ) + w(y) < 
W' < W(X,x,h). 

Suppose now that S' \ {x} does not contain substrings of x. Then r > 2. Let y = s r _i and s' = 
a[l, qi— i]. Observe that y = suffixiyds 7 ). Notice that s' is a superstring of S"\x. Because S'\{x} has 
no substrings of x, every string in 5"\{x} is a substring of any superstring of S w \{x} and, therefore, 
s' is a superstring of S' \ {x} of length at most |s| — |x| + |overlap(y, x)| < h — \x\ + |overlap(y, x)|. 
The weight of 5'\{x} is IT*(A, x, h ) —w(x). By induction, IT*(A, x, h) — w(x) < W(X\{c},y,h — 
|x| + |overlap(y, x)|). Hence IT*(A, x,h) < IT(A \ {c}, y,h — |x| + |overlap(y, x)|) + w(x) < IT" < 
IT (A, x, h). 

To evaluate the running time of the dynamic programming algorithm, observe that we can check 
whether y is a substring of x or find overlap(y, x) in time 0(m) using, e.g., the algorithm of Knuth, 
Morris, and Pratt |16j, and we can construct the table of the overlaps and their sizes in time 0(n 2 m). 
Hence, for each A, the values IT (A, x, h ) can be computed in time 0(n 2 km, log IT), as h < £ < km. 
Therefore, the running time is 0(2 k ■ n 2 km.\ogW). 
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We proved that an optimal colorful solution can be found in time 0( 2 k ■ n 2 km\ogW). Using 
the standard color coding arguments (see ®), we obtain that it is sufficient to consider N = e k 
random colorings of S to claim that with probability a > 0, where a is a constant that does not 
depend on the input size and the parameter, we get a coloring for which k string of S that have 
a superstring of length at most l and the total weight at least W are colored by distinct colors 
if such a string exists. It implies that PARTIAL WEIGHTED SUPERSTRING can be solved in time 
0(( 2e) k ■ kn 2 m\ogW) by our randomized algorithm. 

To derandomize the algorithm, we apply the technique proposed by Alon, Yuster and Zwick [2] 
using the fc-perfect hash functions constructed by Naor, Schulman and Srinivasan |19j . The random 
colorings are replaced by the family of at most e k k logk \ogn hash functions c: S —> {1,. - -, A:} 
that have the following property: there is a hash function c that colors k string of S that have a 
superstring of length at most t and the total weight at least W by distinct colors if such a string 
exists. It implies that Partial Weighted Superstring can be solved in time (2e) k k°^ ogk ' > ■ 
n 2 log n ■ m log W deterministically. □ 

Because Partial Superstring is a special case of Partial Weighted Superstring, The¬ 
orem [2] implies that this problem is FPT when parameterized by k. We show that the same holds 
if we parameterize the problem by l. 

Corollary 1. PARTIAL SUPERSTRING is FPT when parameterized by t. 

Proof. Consider an instance ( S,k,£ ) of PARTIAL SUPERSTRING. Recall that S can contain several 
copies of the same string. We construct a set of weighted strings S' by replacing a string s that 
occurs r times in S by the single copy of s of weight w(s) = r. Let W = k. Observe that there 
is a string s of length at most t such that s is a superstring of a collection of at least k strings 
of S if and only if there a string s of length at most £ such that s is a superstring of a set of 
strings of S' of total weight at least W. A string of length at most l has at most £(£ — l)/2 distinct 
substrings. We consider the instances (S' ,w,k',£,W) of PARTIAL WEIGHTED SUPERSTRING for 
k' £ {1,... ,£(£ — l)/2}. For each of these instances, we solve the problem using Theorem [2] It 
remains to observe that there is a string s of length at most l such that s is a superstring of a set 
of strings of S' of total weight at least W if and only if one of the instances (S',w,k',£,W) is a 
yes-instance of PARTIAL WEIGHTED SUPERSTRING. □ 

We complement the above algorithmic results by showing that we hardly can expect that PAR¬ 
TIAL SUPERSTRING has a polynomial kernel when parameterized by k or l. 

Theorem 3. Partial Superstring does not admit a polynomial kernel when parameterized by 
k + m or £+m for strings of length at most m over the alphabet T, = {0,1} unless NP C coNP /poly. 

Theorem 0 We show that Long Trail OR-cross-composes into Partial Superstring. Recall 
that LONG Trail was shown to be NP-complete in Lemma [2] 

We assume that two instances (G,£) and (G',£') of Long Trail are equivalent if |V(G)| = 
\V (G ,/ ) | and l = £'. Consider equivalent instances ( Gi ,£) of LONG TRAIL for i £ {1,... ,t}. Let 
V(Gf) = { v \,... ,v z n } for i £ {1,... ,t}. Let r = max{[_lognJ, |_logt_|} + 2. Denote by Xi the string 
of length r that encodes a positive integer i in binary for i < 2 r — 1. Let x* = Xi for i = 2 r — 1, 
i.e., x* = ’1...1’. Notice that if i < max{n, t}, then the first symbol of Xj is ’O’. For each arc 
e = (v l p ,v l q ) of Gi , we construct a string s e = XiX*XiX p XiX*XiX q XiX*Xi. Clearly, |s e | = llr. We 
define 

S = {s e | e £ E(Gf), 1 < i <t} 
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and let k = £,£' = 4 r£ + 7r. We claim that there is i £ {1,..., t} such that Gi has a trail of length 
£ if and only if there is a string s of length at most £' that is a superstring of k strings of S. 

Suppose that there is i £ {1,..., t} such that Gi has a trail ei,..., ei. Consider s = s ei o... os Cr 
Because the length of each s ei is Hr and |overlap(s ei _ 1 , s ei )| > 7r, we obtain that |s| < l\r£ — 7r{£ — 
1) = £'. Hence, s is a string of length at most £! that is a superstring of k = £ strings. 

Assume now that there is a string s of length at most £' that is a superstring of k strings of S. 
Because no string of S is a substring of another one, we can assume that s = s ei o ... o s ek for some 
e i, • • • i e fc S E{G\) U ... U E{Gt ) by LemmaQ] We use the following properties of the overlap of two 
strings s e , s e > £ S. Recall that if e = (v p , v q ) of Gi, then s e = XiX*XiX p XiX*XiX q XiX*Xi, x* = ’1... 1’ 
and the first symbol of Xi is ’O’. It implies that |overlap(s e , s e ')l — 7 r an< ^ |overlap(s e , s e ')| = 7r if and 
only if e, e' £ E{Gi) for some i £ {1, ■ ■ ■ ,t} and e = (v p , v q ), e' = (v q , v l z ) for some p, q, z £ {1,..., n}. 
Since |s| <t' = 4 rl + 7r and k = i, |overlap(s ej _ 1 , s ej )| = 7 r for j £ {2,, k}. Hence, e \,..., is 
a trail in some Gi. 

It remains to observe that k + m = 0(n + logf) and £' + m = 0((n + logt) 2 ) to complete the 
proof. □ 

4 Shortest Superstring below guaranteed values 

In this section we discuss SHORTEST SUPERSTRING parameterized by the difference between upper 
bounds for the length of a shortest superstring and the length of a solution superstring. For a 
collection of strings S, the length of the shortest superstring is trivially upper bounded by \x\. 

We show that SHORTEST SUPERSTRING admits a polynomial kernel when parameterized by the 
compression measure of a solution. 

Theorem 4. Shortest Superstring admits a kernel of size 0(r 4 ) when parameterized by r = 
Exe5 \x\-L 

Proof. Let (S,£) be an instance of Shortest Superstring, r = ExeS l x l Fi rs t) we a PPly the 
following reduction rules for the instance. 

Rule 1. If there are distinct elements x and y of S such that x C y , then delete x and set r = r—\x\. 
If r < 0, then return a yes-answer and stop. 

Rule 2. If there is x € S such that for any y £ S \ {m}, |overlap(x, y)\ = |overlap(y, x)| = 0, then 
delete x and set £ = £—\x\. If S = 0 and £ > 0, then return a yes-answer and stop. If £ < 0, then 
return a no-answer and stop. 

Rule 3. If there are distinct elements x and y of S such that |overlap(x, y)\ > r, then return a 
yes-answer and stop. 

It is straightforward to verify that these rules are safe, i.e., by the application of a rule we either 
solve the problem or obtain an equivalent instance. We exhaustively apply Rules 1 -3. To simplify 
notations, we assume that S is the obtained set of strings and £ and r are the obtained values of the 
parameters. Notice that all strings in S are distinct and no string is a substring of another. Our 
next aim is to bound the lengths of considered strings. 

Rule 4. If there is x £ S with |s| > 2r, then set £ = £ — \x\ + 2r and x = prefix r (x)suffix r (x). If 
£ < 0, then return a no-answer and stop. 


To see that the rule is safe, recall that x is not a sub or superstring of any other string of S, 
and |overlap(x, y)\ < r and |overlap(y,x)| < r for any y £ S distinct from x after the applications 
of Rule 3. As before, we apply Rule 4 exhaustively. 

Now we construct an auxiliary graph G with the vertex set S such that two distinct x,y £ S are 
adjacent in G if and only if | over lap (x, y)| > 0 or | overlap (y, x)| > 0. We greedily select a maximal 
matching M in G and apply the following rule. 

Rule 5. If \M\ > r, then return a yes-answer and stop. 

To show that the rule is safe, it is sufficient to observe that if M = ... ,{xh,yh}> 

|overlap(xj, y{)\ > 0 for i £ {1 ,... ,h} and h > r, then the string s obtained by the consecutive 
concatenations with overlaps of x±, y ±,..., x^, yh and then all the other strings of S in arbitrary 
order, then the compression measure of s is at least r. 

Assume from now that we do not stop here, i.e., \M\ < r — 1. Let X C S be the set of end- 
vertices of the edges of M and Y = S \ X. Let X = {xi,..., x/,}. Clearly, h < 2(r — 1). Observe 
that A is a vertex cover of G and Y is an independent set of G. 

For each ordered pair (i, j) of distinct i , j £ {1,... , h}, find an ordering y\, ..., y± of the elements 
of Y sorted by the decrease of |overlap(xj, y p )\ + | overlap (y p , Xj)\ for p G {1,... ,t}. We construct 
the set Ruj) that contains the first min{2/r,t} elements of the sequence. 

For each i G {1,..., h}, find an ordering y±, ..., of the elements of Y sorted by the decrease 
of |overlap(7/ p ,Xj)| for p G We construct the set Si that contains the first min{2/i, t} 

elements of the sequence. 

For each i G {1,..., h}, find an ordering y±,... ,yt of the elements of Y sorted by the decrease 
of |overlap(xj, y p )\ for p G {l,...,t}. We construct the set Tj that contains the first min{2/i, t} 
elements of the sequence. 

Let 

S' = X u( U %j)) u( U Si) U ( U Tj). 

Claim (*). There is a superstring s of S with the compression measure at least r if and only if 
there is a superstring s' of of S' with the compression measure at least r. 

Proof of Claim (*). If s' is a superstring of S' with the compression measure at least r, then the 
string s obtained from s' by the concatenation of s' and the strings of S \ S' (in any order) is a 
superstring of S with the same compression measure as s'. 

Suppose that s is a shortest superstring of S and the compression measure at least r. By 
Lemma [U s = s± o ... o s n , where S = {si,... , s n }. Let 

Z = {sj | Si G Y, [overlap(sj_i, Sj)| >0 or |overlap(sj, Sj+i)| > 0,1 < i < n}; 
we assume that so,Sn+i are empty strings. 

We show that \Z\ < 2 h. Suppose that s* G Z. If | over lap (sj_ i, Sj)| > 0, then G X, 
because Sj G Y and any two strings of Y have the empty overlap. By the same arguments, if 
|overlap(si, Sj+i)| > 0, then Sj + i G X. Because |A| = h, we have that \Z\ < 2 h. 

Suppose that the shortest superstring s is chosen in such a way that \Z \ S' 7 ) is minimum. We 
prove that Z C S' in this case. To obtain a contradiction, assume that there is s* £ Z \ S'. We 
consider three cases. 
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Case 1. |overlap(sj_i, Sj)| > 0 and |overlap(sj, Sj+i)| > 0. Recall that Sj_i,Sj_|_i E X in this case. 
Since Sj ^ S', Si (f R( p , q ) for x p = Sj_i and x q = Sj+i- In particular, it means that |-R( P)(? )| = 2 h. 
As \Z\ < 2 h and |R( Pig )| = 2 h, there is Sj E R( p , q ) such that Sj ^ Z, i.e., |overlap( sj _i, Sj)| = 

|overlap(sj, Sj+i)| = 0. By the definition of R( p , q ), |overlap(sj_i, Sj)| + | overlap (s.,-, Sj+i)| > 
|overlap(sj_i, Sj)| + |overlap(sj, Sj + i)|. Consider s* = si o... o Sj_i o Sj o Si ... o Sj-i o Si o sj o ... o s n 
assuming that i < j (the other case is similar). Because |overlap(sj_i, Sj)| + |overlap(s j, Sj+i)| > 
|overlap(sj_i, Sj)| + |overlap(sj, Sj+i)|, |s*| < |s|. Moreover, since s is a shortest superstring of S, 
|s| > |s*| and, therefore, |overlap(sj_i, Sj)| = |overlap(sj, Sj+i)| = 0. But then for the set Z* con¬ 
structed for s* in the same way as the set Z for s, we obtain that |\S''| < |Z\S / |; a contradiction. 

Case 2. |overlap(sj_i, Sj)| = 0 and |overlap(s,, Sj+i)| > 0. Then s* + i E X. Since Sj ^ S', Si ^ S p 
for x p = Sj+i and \S p \ = 2 h. As \Z\ < 2 h and |5 p | = 2 h, there is Sj E S p such that Sj (ji Z, 
i.e., |overlap(sj_i, Sj)| = |overlap(sj, s J+ i)| = 0. By the definition of S p , \ overlap(s j, Sj + i)| > 
|overlap(sj, Sj+i)|- As in Case 1, consider s* obtained by the exchange of Si and Sj in the sequence of 
strings that is used for the concatenations with overlaps. In the same way, we obtain a contradiction 
with the choice of Z, because for the set Z* constructed for s* in the same way as the set Z for s, 
we obtain that | Z* \ S'\ < \Z \ S'\. 

Case 3. |overlap(sj_i, Si)\ > 0 and |overlap(sj, Sj_|_i)| = 0. To obtain contradiction in this case, we 
use the same arguments as in Case 2 using symmetry. Notice that we should consider T p instead of 

S p . 

Now let s' = SjjO.. ,o Si p , where sq,..., Si p is the sequence of string of S' obtained from si,...,s n 
by the deletion of the strings of S \ S'. Because we have that Z C S ', the overlap of each deleted 
string with its neighbors is empty and, therefore, s' has the same compression measure as s. □ 

To finish the construction of the kernel, we define £' = l — YlxeS\S' M an< ^ a PPly th e following 
rule that is safe by Claim (*). 

Rule 6. If £' < 0, then return a no-answer and stop. Otherwise, return the instance (S',£') and 
stop. 

Since \X\ = h< 2 (r - 1), \S'\ < h + h 2 • 2h + h • 2h + h ■ 2h = 2h 3 + 4/i 2 + h = 0(h 3 ) = 0(r 3 ). 
Because each string of S' has length at most 2r, the kernel has size 0(r 4 ). 

It is easy to see that Rules 1-3 can be applied in polynomial time. Then graph G and M can be 
constructed in polynomial time and, trivially, Rule 5 demands 0(1) time. The sets X, Y, Ruj), Si 
and Tj can be constructed in polynomial time. Hence, S' and £' can be constructed in polynomial 
time. Because Rule 6 can be applied in time 0(1), we conclude that the kernel is constructed in 
polynomial time. □ 

Now we consider another upper bound for the length of the shortest superstring. Let S’ be a col¬ 
lection of strings. We construct an auxiliary weighted graph G(S ) with the vertex set S by assigning 
the weight w({x,y}) = max{|overlap (x, y)\, |overlap(y, x)|} for any two distinct x,y E S. Let y(S) 
be the size of a maximum weighted matching in G. Clearly, G can be constructed in polynomial 
time and the computation of y(G) is well known to be polynomial |7]. If M = {x\, y ±},..., {xh, yh} 
and |overlap(xj, yi)\ = w({xi,yi\) for i E {1,..., h}, then the string s obtained by the consecutive 
concatenations with overlaps of x\,y\,... ,Xh,yh an d then (possibly) the remaining string of S has 
the compression measure at least y(G). Hence, |^| — ^(G) is th e upper bound for the length 

of the shortest superstring of G. We show that it is NP-hard to find a superstring that is shorter 
than this bound. 
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Theorem 5. Shortest Superstring is NP-complete fori = |x|— fi(S) — 1 even if restricted 

to the alphabet £ = {0,1}. 

Proof. We reduce LONG TRAIL that was shown to be NP-complete in Lemma[2]for l = |V(G)| — 1. 
Let (G,i) be an instance of the problem, n = |V(G)| = £ + 1. We assume that n > 2 6 = 64. Let 
V ( G ) = {ui,..., v n } and E(G) = {e \,..., e m }. Let also p = |~(n— l)/3] and q = n — 1 — 2p. Denote 
by z = ’01... 1’ and z* = 1 1... 1’ the strings of length p such that the first symbol of z is ’0’ and all 
the other symbols are ’l’-s and z* is a strings of T’-s. For a positive integer i < 2 q ~ 1 — 1, denote by 
Xi the string of length q — 1 that encodes i in binary and by y, the string of length q that encodes 
2 i. Notice that q > n /3 — 4 and logn 2 < q — 3, because n > 2 6 . Hence, the first symbols of and 
Hi are ’0’ if i < n 2 . Observe also that the last symbol of each j/j is ’O’. For each h G {1,... , m}, we 
consider the arc = (wj, Vj) of G and construct two strings: 

• Sh = zy h z*zxiz*zxjz*, 

• s' h = ZXiZ*ZXjZ*ZyhZ* . 

We define S = {sh, s' h | 1 < h < to}. 

We need the following properties of the strings of S. 

i) For h G {1,..., to}, |overlap(s/ l , s})| = 2(n — 2) and |overlap (s' h , s^)| = n — 1. 

ii) For distinct h, h! G {1,..., m}, overlap(s/,,, s' h ,)\ = n — 2 if the head of eh coincides with the 
tail of eh' and |overlap(s/ l , 4') I = 0 otherwise. 

iii) For distinct h, h! G {1,..., to}, |overlap( s}, s^)! = |overlap(s^, = |overlap(s}, s^,)| = 0. 

These properties immediately follow from the dehnition of Sh,s' h and the facts that \z\ = \z*\ > 
\Uh\ = \xi\ + 1 = \xj\ + 1, the strings z,yh,Xi,Xj start with ’O’, the last symbol of yh is ‘O’, and 
z = ’01... 1’, z* = T ... 1’. It is sufficient to notice that if the overlap of two strings is not empty, 
then the p-th prefix and suffix of the overlap is always z and z* respectively. 

Now we consider the weighted graph G(S) and observe that M = {{s^,,?}} | 1 < h < to} is a 
maximum weight matching in G(S) and p(S) = 2 (n — 2)m by (i)-(iii). 

We claim that G has a trail of length at least t = n — 1 if and only if S has a superstring of 
length at most if = s \x\ — p(S) — 1. 

Suppose that the sequence of arcs e,q,..., e l( composes a trail in G. Let {ej 1 ,..., &j rn _ f } = 
S \ (eq,..., e it }. Consider 


s = s h o s 


*i 


° S H ° S i 


o s 


'31 


o Sjl O 


o s 


3 m —I 


o s 


3 m —i 


Since |overlap(s'^, Sj fc )| = n—1 for h G (1, ...,£} by (i), |overlap(sj /i _ 1 , s ' ih ) = n — 2 for h G {2,..., i} 
by (ii) and | over lap (sj h , s' Jh ) \ = 2 (n — 2) by (i), the compression measure of s is t = (n — 1 )i + (n — 
2 ){l — 1) + 2(n — 2 )(to — i) and t — /r(S’) = (n — 1 )i — (n — 2 ){l + 1) = 1. Hence, s is a superstring 
of S of length at most if. 

Assume that s is a shortest superstring of S and |s| < if. By Lemma [fl we can assume that s 
is obtained from a sequence a of the strings of S by the concatenations with overlaps. 

We show that for every h G {1,..., to}, either Sh, s' h or s' h , Sh are consecutive in a. To obtain a 
contradiction, assume first that for some h G {1,..., to }, Sh occurs in a before s’ h but these strings 
are not consecutive. Let a be the predecessor of Sh, b be a predecessor of s' h and c be a successor of 
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s' h in cr; if Sh is the first element of o or s' h is the last element, we assume that a or c is the empty 
string respectively. Then | overlap (a, Sh)\ = | overlap (s)J, c| = 0 by (iii) and |overlap(6, sjj| <n — 2 
by (ii) and (iii). Consider the sequence o' obtained from o by the placement of s' h between a and S} x . 
Because |overlap(sY s^)| = n — 1 by (1), the string s' obtained from o' by the concatenations with 
overlaps has length at most |s| — 1; a contradiction. Suppose now that for some h E {1,... , m}, 
s' h occurs in o before Sh but these strings are not consecutive. Let a be the successor of s' h , b be a 
predecessor of Sh and c be a successor of Sh in cr; if Sh is the last element of o, we assume that c is the 
empty string. We have that |overlap(s^, a)| = |overlap(6, s/j.) | = 0 by (iii) and |overlap(s^, c)| < n — 2 
by (ii) and (iii). Consider the sequence o' obtained from a by the placement of Sh between s' h and 
a. Because | overlap^, s/,.)| = n — 1 by (i), the string s' obtained from o' by the concatenations 
with overlaps has length at most |s| — 1; a contradiction. 

We decompose o into inclusion maximal subsequences oi ,..., o r such that the overlap between 
any two consecutive strings in each subsequence is not empty. Because either Sh,s' h or s' h ,Sh are 
consecutive in a for h E {1,... , m} and |overlap(s/ l , s' h )| = 2(n — 2) and |overlap^, Sh)\ = n — 1 by 

(i) , each pair Sh, s' h is in the same subsequence. In particular, it means that the number of elements 
in each subsequence is even. Let n, be the size of Oi and let in, be the string obtained by the 
concatenation with overlaps from <7, for i E {1,..., r}. Because n\ + ... + n T = 2m, \M\ = m and 
the compression measure of s is at least /u(S') + 1, there is i E {1, ■ ■ ■ ,r} such that the compression 
measure a of Wi is at least rii/2 ■ n(S)/m + 1 = n*(n — 2) + 1. 

Suppose that Sh,s' h are in cr,; for some h E {1,... ,m}. Then they are consecutive. If Sh has a 
predecessor a in o, then |overlap(a, Sh )| = 0, and if s' h has a successor b in cr, then |overlap^, 6)| = 0 
by (iii). Hence cq = Sh, s' h and n, = 2 in this case, but then by (i), a = 2(n — 2) < rii(n — l)/2 + 1; a 
contradiction. It follows that w x = s^osqo.. .os( fc os, fc , where distinct i\, ..., E {1,..., m} and k = 
n.i/2. Since for j E {2,... , k}, the overlap between s, _ 1 and s'- is not empty, |overlap(sj_ 1 , s(.)| = 
n — 2 and the head of the arc &i j _ 1 is the tail of e t] . Hence, e xi ,..., is a trail in G. By (i) and 

(ii) , we have that a = k(n — 1) + (k — l)(n — 2) > 2 k{n — 2) + 1. Therefore, k > n — 1, i.e., G has 

a trail of length at least t = n — 1. □ 
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